Attention on Attention: Architectures for Visual Question Answering (VQA)
نویسندگان
چکیده
Visual Question Answering (VQA) is an increasingly popular topic in deep learning research, requiring coordination of natural language processing and computer vision modules into a single architecture. We build upon the model which placed first in the VQA Challenge by developing thirteen new attention mechanisms and introducing a simplified classifier. We performed 300 GPU hours of extensive hyperparameter and architecture searches and were able to achieve an evaluation score of 64.78%, outperforming the existing state-of-the-art single model’s validation score of 63.15%. The code is available at github.com/SinghJasdeep/Attention-on-Attentionfor-VQA.
منابع مشابه
Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
The problem of Visual Question Answering (VQA) requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on recurrent LSTM networks to this problem, but have failed to model spatial inference. In this paper, we propose a memory network with spatial attention for the VQA task. Memory networks ...
متن کاملABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering
We propose a novel attention based deep learning architecture for visual question answering task (VQA). Given an image and an image-related question, VQA returns a natural language answer. Since different questions inquire about the attributes of different image regions, generating correct answers requires the model to have questionguided attention, i.e., the attention on the regions correspond...
متن کاملCo-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering
Recently, the Visual Question Answering (VQA) task has gained increasing attention in artificial intelligence. Existing VQA methods mainly adopt the visual attention mechanism to associate the input question with corresponding image regions for effective question answering. The freeform region based and the detection-based visual attention mechanisms are mostly investigated, with the former one...
متن کاملHuman Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?
We conduct large-scale studies on ‘human attention’ in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate...
متن کاملDual Attention Network for Visual Question Answering
Visual Question Answering (VQA) is a popular research problem that involves inferring answers to natural language questions about a given visual scene. Recent neural network approaches to VQA use attention to select relevant image features based on the question. In this paper, we propose a novel Dual Attention Network (DAN) that not only attends to image features, but also to question features....
متن کامل